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Figure 7: Slow Down Gesture. 




Figure 8: Prepare to Move Gesture. 
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Figure 10: Stop Gesture. 




Figure 1 1 : Right or Left Turn Gestures. 



Figure 12: "Okay" Gesture 
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Rule for q to Bring the Error to Zero. 
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Line. 
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Figure 18: The Recursive Linear Least 
Squares Method for Updating q with Each 
Additional (xi,yi) Data Point. 
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Figure 19: An Exaggerated Representation 
of the Residual Error Measurement. 
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Figure 20: An Algorithm for Determining the 
Specific Gesture Model. 
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Figure 23: Bounding Box Around Hand. 




Figure 24: Descriptions from Bounding Box. 
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Figure 25: The Example Gestures. 
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Figure 27: Flowchart of the CTS. 
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Figure 30: Color Matching Technique. 
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GESTURE-CONTROLLED INTERFACES 
FOR SELF-SERVICE MACHINES AND 
OTHER APPLICATIONS 

REFERENCE TO RELATED APPLICATIONS 

This application claims priority of U.S. provisional patent 
application Scr. No. 60/096,126, filed Aug. 10, 1998, the 
entire contents of which are incorporated here by reference. 

STATEMENT 

This invention was made with Government support under 
contracts NAS9-98068 (awarded by NASA), DASW01-98 
M-0791 (awarded by the U.S. Army), and F29601-98-C- 
0096 (awarded by the U.S. Air Force). The Government has 
certain rights in this invention. 

FIELD OF THE INVENTION 

This invention relates to person-machine interfaces and, 
in particular, to gesture-controlled interfaces for self-service 
machines and other applications. 

BACKGROUND OF THE INVENTION 

Gesture recognition has many advantages over other input 
means, such as the keyboard, mouse, speech recognition, 
and touch screen. The keyboard is a very open ended input 
device and assumes that the user has at least a basic typing 
proficiency. The keyboard and mouse both contain moving 
parts. Therefore, extended use will lead to decreased per- 
formance as the device wears down. The keyboard, mouse, 
and touch screen all need direct physical contact between the 
user and the input device, which could cause the system 
performance to degrade as these contacts are exposed to the 
environment. Furthermore, there is the potential for abuse 
and damage from vandalism to any tactile interface which is 
exposed to the public. 

Tactile interfaces can also lead hygiene problems, in that 
the system may become unsanitary or unattractive to users, 
or performance may suffer. These effects would greatly 
diminish the usefulness of systems designed to target a wide 
range of users, such as advertising kiosks open to the general 
public. This cleanliness issue is very important for the touch 
screen, where the input device and the display are the same 
device. Therefore, when the input device is soiled, the 
effectiveness of the input and display decreases. Speech 
recognition is very limited in a noisy environment, such as 
sports arenas, convention halls, or even city streets. Speech 
recognition is also of limited use in situations where silence 
is crucial, such as certain military missions or library card 
catalog rooms. 

Gesture recognition systems do not suffer from the prob- 
lems listed above. There are no moving parts, so device wear 
is not an issue. Cameras, used to detect features for gesture 
recognition, can easily be built to withstand the elements and 
stress, and can also be made very small and used in a wider 
variety of locations. In a gesture system, there is no direct 
contact between the user and the device, so there is no 
hygiene problem. The gesture system requires no sound to 
be made or detected, so background noise level is not a 
factor. A gesture recognition system can control a number of 
devices through the implementation of a set of intuitive 
gestures. The gestures recognized by the system would be 
designed to be those that seem natural to users, thereby 
decreasing the learning time required. The system can also 
provide users with symbol pictures of useful gestures similar 
to those normally used in American Sign Language books. 



1,031 B2 

2 

Simple tests can then be used to determine what gestures are 
truly intuitive for any given application. 

For certain types of devices, gesture inputs are the more 
practical and intuitive choice. For example, when control- 

s ling a mobile robot, basic commands such as "come here", 
"go there", "increase speed", "decrease speed" would be 
most efficiently expressed in the form of gestures. Certain 
environments gain a practical benefit from using gestures. 
For example, certain military operations have situations 

10 where keyboards would be awkward to carry, or where 
silence is essential to mission success. In such situations, 
gestures might be the most effective and safe form of input. 

A system using gesture recognition would be ideal as 
input devices for self-service machines (SSMs) such as 

15 public information kiosks and ticket dispensers. SSMs are 
rugged and secure cases approximately the size of a phone 
booth that contain a number of computer peripheral tech- 
nologies to collect and dispense information and services. A 
typical SSM system includes a processor, input device(s) 

20 (including those listed above), and video display. Many 
SSMs also contain a magnetic card reader, image/document 
scanner, and printer/form dispenser. The SSM system may 
or may not be connected to a host system or even the 
Internet. 

25 The purpose of SSMs is to provide information without 
the traditional constraints of traveling to the source of 
information and being frustrated by limited manned office 
hours or to dispense objects. One SSM can host several 
different applications providing access to a number of 

30 information/service providers. Eventually, SSMs could be 
the solution for providing access to the information con- 
tained on the World Wide Web to the majority of a popu- 
lation which currently has no means of accessing the Inter- 
net. 

35 SSMs are based on PC technology and have a great deal 
of flexibility in gathering and providing information. In the 
next two years SSMs can be expected to follow the tech- 
nology and price trends of PC's. As processors become 
faster and storage becomes cheaper, the capabilities of SSMs 

40 will also increase. 

Currently SSMs are being used by corporations, 
governments, and colleges. Corporations use them for many 
purposes, such as displaying advertising (e.g. previews for a 
new movie), selling products (e.g. movie tickets and 

4 5 refreshments), and providing in-store directories. SSMs are 
deployed performing a variety of functions for federal, state, 
and municipal governments. These include providing motor 
vehicle registration, gift registries, employment information, 
near-real time traffic data, information about available 

50 services, and tourism/special event information. Colleges 
use SSMs to display information about courses and campus 
life, including maps of the campus. 

SUMMARY OF THE INVENTION 

55 The subject invention resides in gesture recognition meth- 
ods and apparatus. In the preferred embodiment, a gesture 
recognition system according to the invention is engineered 
for device control, and not as a human communication 
language. That is, the apparatus preferably recognizes com- 

60 mands for the expressed purpose of controlling a device 
such as a self-service machine, regardless of whether the 
gestures originated from a live or inanimate source. The 
system preferably not only recognizes static symbols, but 
dynamic gestures as well, since motion gestures are typi- 

65 cally able to convey more information. 

In terms of apparatus, a system according to the invention 
is preferably modular, and includes a gesture generator, 
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sensing system, modules for identification and transforma- of a gesture it represents will exhibit a smaller residual error 

tion in to a command, and a device response unit. At a high than a bin predicting the future state of a gesture that it does 

level, the flow of the system is as follows. Within the field not represent. For simple dynamic gestures applications, a 

of view of one or more standard video cameras, a gesture is linear- with-oflset-component model is preferably used to 

made by a person or device. During the gesture making 5 discriminate between gestures. For more complex gestures, 

process, a video image is captured, producing image data a variation of a velocity damping model is used, 
along with timing information. As the image data is 

produced, a feature -tracking algorithm is implemented BRIEF DESCRIPTION OF THE DRAWINGS 
which outputs position and time information. This position 

information is processed by static and dynamic gesture FIG. 1 is a drawing of a gesture recognition system 

recognition algorithms. When the gesture is recognized, a according to the invention; 

command message corresponding to that gesture type is sent FIG. 2 is a gesture recognition system flow chart; 

to the device to be controlled, which then performs the pjQ 3 is a signal flow diagram of a gesture recognition 

appropriate response. system according to the invention; 

Hie system only searches for static gestures when the 15 rg. 4 is a drawing which shows example gestures in two 

motion is very slow (i.e. the norm of the x and y — and dimensions; 

z— velocities is below a threshold amount). When this . , t _ o , „ 

A , A „ . FIG. 5 snows three example gestures; 

occurs, the system continually identifies a static gesture or . 

outputs that no gesture was found. Static gestures are FIG, 6 is an example of a 24-gesture lexicon according to 

represented as geometric templates for commonly used 20 me invention; 

commands such as Halt, Left/Right Turn, "OK," and Freeze. FIG. 7 depicts a Slow-Down gesture; 

Language gestures, such as the American Sign Language, FIG. 8 depicts a Move gesture; 

can also be recognized. A file of recognized gestures, which pIG. 9 depicts an Attention gesture; 
lists named gestures along with their vector descriptions, is 

loaded in the initialization of the system. Static gesture 25 r r & 

recognition is then performed by identifying each new RG n shows Right/Uft Turn gestures; 

description. A simple nearest neighbor metric is preferably FIG. 12 shows an "Okay" gesture; 

used to choose an identification. In recognizing static human FIG. 13 shows a Freeze gesture; 

hand gestures, the image of the hand is preferably localized piG. 14 provides three plots of a human created one 

from the rest of the image to permit identification and 30 ^^0^1 X -Line oscillating motion; 

classification. T*e edges of the image are preferably found ^ k ^ ^ x(t ) 0+ 

with aSobel operator. ^ A box which i^ty encloses the hand ^ ^ ^ ntotfon in the p . parameter 

is also located to assist in the identification. . 

Dynamic (circular and skew) gestures are preferably SP 1^ . u * 1 ■ a 

treated as one-dimensional oscillatory motions. Recognition 35 , FIG " ■ T P ™ ^ * 15 

of higher-dimensional motions is achieved by independently for * t0 brul S the error to ztT0 > 

recognizing multiple, simultaneously created one- FIG. 17 plots different (xi,yi) data points resulting in a 

dimensional motions. A circle, for example, is created by different best fitting q line; 

combining repeating motions in two dimensions that have FIG. 18 depicts a recursive linear least squares method for 

the same magnitude and frequency of oscillation, but 40 updating q with subsequent (xi,yi) data points; 

wherein the individual motions ninety degrees out of phase. pj G 19 illustrates an algorithm for determining a specific 

A diagonal line is another example. Distinct circular ges- gesture model according to the invention; 

tures are defined in terms of their frequency rate; that is, FK} 2 0 is an exaggerated representation of a residual 

slow, medium, and fast. error measurement; 

Additional dynamic gestures are derived by varying phase 45 mQ n . fo ^ shows W0Rit ^ ^ 

relationships. During the analysis of a particular gesture, the for each £ ^ ^ ^ (he 

x and y minimum and maximum image plane positions are Det ter the model- 
computed. Z position is computed if the system is set up for 

three dimensions. If the x and y motions are out of phase, as ™. 22 illustrates how two perpendicular oscillatory line 

in a circle, then when x or y is minimum or maximum, the 50 molions m ^ be combmed into a circular S esture i 

velocity along the other is large. The direction FIG. 23 shows how a bounding box may be placed around 

(clockwiseness in two dimensions) of the motion is deter- a hand associated with a gesture; 

mined by looking at the sign of this velocity component. FIG. 24 provides descriptions from the bounding box of 

Similarly, if the x and y motion are in phase, then at these FIG. 23; 

extremum points both velocities are small. Using clockwise 55 FIG. 25 shows example gestures; 

and counter-clockwise circles, diagonal lines, one- FIG. 26 is a schematic of hand-tracking system hardware 

dimensional lines, and small and large circles and lines, a according to the invention- 

twenty-four gesture lexicon was developed and described nG 2? fa a Q ' f fl colof ^ m (crs) 

herein. A similar method is used when the gesture is per- . n „ antnn . 

0 r according to the invention; 

formed in three dimensions. 60 . D . . . c , ... . - 

A . t . . Ctl _ . ( . . c FIG. 28 depicts a preferred graphical user interf ace of the 

An important aspect of the invention is the use of pa ram- v v & v 

eterization and predictor bins to determine a gesture's future m ' „ rt , ,. . 

position and velocity based upon its current state. The bin 29 ^trates the application of target center from 

predictions are compared to the next position and velocity of difference image techniques; 

each gesture, and the difference between the bin's prediction 65 FIG. 30 illustrates a color matching technique; 

and the next gesture state is defined as the residual error. FIG. 31 is a representation of an identification module; 

According to the invention, a bin predicting the future state and 
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FIG. 32 is a simplified diagram of a dynamic gesture neously in two or three dimensions. A circle is such a 

prediction module according to the invention. motion, created by combining repeating motions in two 

nccpD idtthm nc ttjtt dimensions that have the same magnitude and frequency of 

DETAILED DESCRIPTION OF THE oscillation, but with the individual motions ninety degrees 

INVENTION s ou t 0 f p 0asCt a "diagonal" line is another such motion. We 

FIG. 1 presents a system overview of a gesture controlled nave defined three distinct circular gestures in terms of their 

self service machine system according to the invention. FIG. frequency rates: slow, medium, and fast. An example set of 

2 shows a flow chart representation of how a vision system suc r Q S esturcs 15 shown ™ FI ?- 4 - Z^Tts can also be 

is views the gesture created, with the image data sent to the performed in three Amensnns, and such more complex 

gesture recognition module, translated into a response, and ,n mo J* 01 * can bc ldcntlficd b * ^ system. 

f. . ecu ■ i j • .u j* i f j . . The dynamic gestures are represented by a second order 

then used to control a SSM, including the display of data, a J f 6 . * } 

. . . , , ' . * 7 equation, one for each axis: 

virtual environment, and devices. The gesture recognition n 

system takes the feature positions of the moving body parts iimX2 

(two or three dimensional space coordinates, plus a time ± «e r +e 

stamp) as the input as quickly as vision system can output is 2112 

the data and outputs what gesture (if any) was recognized, More complex second-order models are used to recognae 

again at the same rate as the vision system outputs data. com P lcx g cslurcs (discussed later). This gesture model 

° .„ c iL ... has no size parameter. 0, is a frequency measure, and 6, 

The specific components of the gesture recognition sys- ^ fl ^ cQ £ { ^ £ ^ named U]& ,? 

tem are detailed m FIG. 3, and these include five modules: ttfasI „ aD(J due to ^ human motions used 

Gesture Generation 20 to determine the parameters (see FIG. 5). A fast small circle 

S: Sensing (vision) is used to represent a fast oscillation because humans can not 

I: Identification Module make fast oscillations using large circles. 

T Transformation ^or example, a total of twenty four gestures are possible 

when the following are distinct gestures: clockwise and 

. esponse # 25 counter-clockwise circles, diagonal lines, one dimensional 

" u S V?T C V hC fl ° W , ° f ^ SyStCm 1S ? [ f°7 S - Hnes, and small and large circles and lines. Geometric 

Within the field of view of one or more standard video constraints m required to expand lhe lexic0Dj because 

cameras, a gesture is made by a person or device. During the different gestures can result in the same parameters. FIG. 6 

gesture making process, a video capture card is capturing shows mot j ons ma t would cause an identifier to produce the 

images, producing image data along with timing informa- 30 same frequency measure and drift components as it would 

uon. As the image data is produced, they are run through a produce when identifying a slow large circle. When x and y 

feature tracking algorithm which outputs position and time oscillating motions are 90 degrees out of phase, a clockwise 

information. This position information is processed by static circle ^ produced. Motions that are 270 degrees out of phase 

and dynamic gesture recognition algorithms. When the fCSult in a counter-clockwise circle. In-phase motions pro- 

gesture is recognized, a command message corresponding to 35 duC e a line with a positive slope. When the motions are 180 

that gesture type is sent to the device to be controlled, which degrees out of phase, a line with a negative slope is 

then performs and appropriate response. Hie five modules produced. We can create additional gestures from the fast 

are detailed below. small circle m me ^ manner . 

Gesture Creator ^ w j trj tDe p rev ious gestures, additional gestures can be 

In the Gesture Creator module, a human or device creates 40 create d f ra m these two gestures by varying the phase 

a spatial motion to be recognized by the sensor module. If relationships. FIG. 6 shows a representation of the 24 

one camera is used, then the motion generated is two gestures in possible lexicon. Even more gestures are pos- 

dimensional and parallel to the image plane of the monocu- sible when ^ ^frd dimension is used, 

lar vision system. For three dimensional tracking (as is also phase relationships are determined as follows. During the 

done with this system), stereo vision using two or more 45 gcstur e, the x's and y's (and z's, if the system is set up for 

cameras are used. three dimensions) minimum and maximum image plane 

The subject gesture recognition system is designed to positions are computed. If the x and y motions are out of 

recognize consistent yet non-perfect motion gestures and phasej ^ m a circlej then when x or y is minimum or 

non-moving static gestures. Therefore, a human can create maximum, the other axis' s velocity is large. The direction 

such gestures, as well as an actuated mechanism which 50 (clockwise ness in two dimensions) of the motion is deter- 

could repeatedly create perfect gestures. Human gestures are mined by looking at the sign of ^ velocity component, 

more difficult to recognize due to the wide range of motions Similarly, if the x and y motion are in phase, then at these 

that humans recognize as the same gesture. We designed our extremum points both velocities are small, 

gesture recognition system to recognize simple Lissagous Example dynamic gestures used for real world situations 

gesture motions (repeating circles and lines), repeated com- 55 were derived from a standard Army Training Manual. A 

plex motions (such as "come here" and "go away quickly" « slow Down" gesture is a small x-line created to one side of 

back and forth hand motions which we define as "skew" me body 7> left side ^ A «rj a y Move" gesture is a 

gestures), and static hand symbols (such as "thumbs-up"). counterclockwise large slow circle (FIG. 8, left side). The 

With regards to human generated gestures used for com- "Attention" gesture is a large y-line overhead motion (FIG. 

munication or device control, we chose gestures to be 60 9) thrcc gcsturcs ar e representative of the motion 

identified based on the following: gestures used throughout the Army manual. 

Humans should be able to make the gestures easily. static gestures are represented as geometric templates. 

The gestures should be easily represented mathematically. Four gestures are shown and are representative of the static 

The lexicon should match useful gestures found in real gestures which can be represented and identified by this 

world environments. 65 gesture recognition system. Additionally, language gestures, 

For the dynamic (circular and skew) gestures, these such as American Sign Language gestures, can also be 

consist of one-dimensional oscillations, performed simulta- recognized. 
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The example static gestures are: 

Halt— stop hand above head (FIG. 10— left side of 
figure). 

Left and Right turn — fingers together, palm out, facing 
left or right (FIG. 11— left side of figure). 

Message Acknowledge (OK) — thumb up (FIG. 12). 

Freeze — Fist at head level (FIG. 13). 
Identifying Moving Gestures Represented as a Dynamic 
System 

The gesture recognition system identifies a moving ges- 
ture by its dynamics — that is, the structure of its positions in 
space over time. The system translates the motion informa- 
tion into parameters which are used to develop commands 
for controlling data outputs and actuated mechanisms. For 
example, the speed at which a person waves a robot away 
might directly affect a robot arm's velocity or a mobile 
robot's speed. In order for recognition to occur, a represen- 
tation for human gestures is required, from which a com- 
putational method for determining and recognizing specific 
gestures can be derived. 

Although we make these gestures in two and three 
dimensions, the explanation now detailed is described sim- 
ply dimension as a basic one-dimensional gesture as a 
simple example to clarify the distinction between the 
"shape" and the "dynamics" of a gesture. The techniques for 
identifying this basic gesture may be used to identify similar 
oscillatory motions occurring in two and three dimensions. 

First, a dynamic system gesture representation is 
determined, both the model for representing the oscillatory 
gestures and parameter determination scheme was devel- 
oped. For this system a Linear Least Squares method was an 
on-line computationally efficient technique which allowed 
us to use a line ar-in -parameters gesture model. 

The representative planar gesture used throughout this 
section to exemplify our method consists of a family of 
oscillating motions which form a (roughly) horizontal line 
segment ("x-line motion"). As discussed earlier, a human is 
incapable of reliably generating a perfect sinusoidal motion. 
FIG. 14 illustrates the imperfections of a human created 
x-line motion viewed in three plots. The plots represent the 
position of the gesture over time, x(t). Viewing position with 
respect to time in contrast to position and velocity over time 
provides insight into how we propose to represent gestures. 
Plot A (leftmost) shows the planar motion in x-position and 
y-position coordinates, with the gesture's motion con- 
strained to the x-axis. Thus, the "shape" of the motion 
conveys relatively little information. Plot B (center) shows 
the same gesture in x-position plotted against time, empha- 
sizing the oscillatory behavior we wish to capture. Plot C (at 
right) represents the record of x-velocity plotted against 
x-position over time. We will find it most convenient to 
represent this motion as it evolves over time in this position 
versus velocity space, which is called the "phase plane". Of 
course, when a human creates a gesture, the resulting motion 
does not translate into the perfect sinusoid of plot B or a 
perfect circle of plot C. Instead, there is a natural range of 
variation that we would nevertheless like to associate with 
the same gesture. This association we find most naturally 
achievable in phase space. 

For this dynamic gesture recognition module, a compu- 
tationally effective mathematical representation for the ges- 
ture plotted in FIG. 14 is required. A general representation 
for time functions might take the form 

where "?" would be replaced with some structure based on 
measurable features which are used to classify the gesture. 
Of course, there are an infinite number of possible measur- 
able features. 
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We can make the number of classifications (the "feature 
space" dimension) finite by restricting the form of the 
representations. Instead of representing gestures as x(t), the 
representation might be constrained through the use of a 
S parameter vector, resulting in x(t,p). The feature space 
dimension is then equivalent to the number of parameters we 
store. For example, when: 

10 the only possible gestures that we can represent are lines 
described by the two parameters slope, p a , and intercept p 0 
(see FIG. 15). 

Even with a finite dimensional representation, each 

unique motion is represented by its own distinct parameters. 
15 However, our intuition about human gestures tells us that 

certain distinct motions should have the same classification. 

Consider the x-line oscillating gesture discussed earlier. 

Whether the gesture starts at the left side of the line or the 

right side (for example, x(0)— 1 or x(0)-+l), the resulting 
20 motions would still be identified by a human as the same 

gesture. Therefore, another type of representation seems 

desirable. 

Since a human hand forms a gesture, we could imagine a 
representation in terms of the force exerted by the person's 

25 arm muscles. Alternatively, we might imagine representing 
the gesture as a function of the nerve impulses that travel 
from the brain to the arm's muscles. However, quite clearly, 
most of the countless types of such "internal" representa- 
tions are presently impossible to quantify in any useful 

30 manner. 

Four hundred years ago, Newton developed a parsimoni- 
ous representation of physical motions based on their 
dynamic properties, 

35 ifl-ftO 

A dynamic system is a mathematical model describing the 
evolution of all possible states in some state space as a 
function of time. The set of all possible states is a state space. 

40 Given an initial state, the set of all subsequent states as it 
evolves over time is a "trajectory" or "motion". For any 
initial condition, the future evolution of the states in a 
trajectory remains within that trajectory (the trajectory is an 
invariant set). Thus, all that is required to describe a par- 

45 ticular spatial motion is the differential equation represen- 
tation and its initial conditions. We use a deterministic 
representation, as opposed to a stochastic one, because we 
believe these oscillatory motions are best represented by 
sine waves or a sum of exponentials as opposed to charac- 

50 terislics based on statistical properties. 

As with the geometric representation, there are an infinite 
number of gesture classifications of the form x(t)»J(x). 
However, as before, we can choose a vector of tunable 
parameters to make the number of gesture classifications 

5S finite. Such representation has the form: 

i(0-/W) 

where 6 represents the tunable parameters. Fixing the value 
of 6 in a given representation yields a unique set of motions, 

60 with different initial conditions, described by x(t)»f(x,0). 
Motivated by the way humans interpret gestures, we asso- 
ciate an entire set of motions with one specific gesture. Thus, 
choosing different values of 9 in a given representation 
results in a "family" of trajectories sets — a "gesture family." 

65 For example, consider a oscillatory line gesture, the motion 
of which is constrained to the x-axis. This gesture can be 
represented in the following two-dimensional state space: 
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where Xj represents the position of the gesture, x 2 is its 
velocity, and Q 1 is a specified negative parameter. For any 5 
constant B>0, all trajectories satisfy -8 1 x 1 2 +x 2 2 =const as 
can be seen by direct differentiation. 

A specific gesture may be considered as a family of sets 
of trajectories, A human can start the gesture at any point 
(initial condition) in its trajectory, and the gesture should 10 
still be identified as the same oscillating line. 
We represent a given family of gestures (family of sets of 
trajectories) by a mathematical model which contains a finite 
number of tunable parameters. A mathematical model 
described by differential equations, as above, allows the 15 
development of a computational scheme that will determine 
which parameters, the values of 0/s, correspond to a specific 
gesture. The set of all valid parameters is the parameter 
space. The parameter space defines the family of gestures 
which can be represented by the model. In order to catego- 20 
rize a finite number of gestures in this family and to permit 
further variability in the exact motions associated with a 
particular gesture within this family, we partition the param- 
eter space into a finite number of cells — the "lexicon* 7 — and 
associate all the parameter values in the same cell with one 25 
gesture. 

We have derived certain differential equations, composed 
of state variables and parameters, which intuition suggests 
may represent human gestures. Such differential equation 
models can be divided into two types: non-linear-in- 30 
parameters (NUP) and linear-in-parameters (LIP). The two 
models can be further subdivided into linear-in-state (LIS) 
and non-linear- in-state (NLIS). It is advantageous to use a 
NLIP (with NLIS) model because it covers, by definition, a 
much broader range of systems than an LIP model. 35 
However, for reasons to be discussed below, we find it 
expedient to use a LIP model for our gesture representation. 

We have chosen to represent planar oscillatory gestures as 
a second-order system believing that a model based on the 
acceleration behavior (physical dynamics) of a system is 40 
sufficient to characterize the oscillatory gestures in which we 
are interested. This system's slates are position and velocity. 
However, the vision system we use to sense gestures yields 
only position information. Since velocity is not directly 
measured, then either the parameter identification method 45 
could be combined with a technique for observing the 
velocity, or the velocity could be determined through posi- 
tion differences. In the following section we show tech- 
niques for determining gesture parameters both when the 
velocity state is observed, and when it is obtained through 50 
position differences. By examining the utility of each 
technique, we develop an appropriate form of the gesture 
model and parameter identification method. 

A difficulty with using human created gestures is that the 
underlying true physical model is unknown. Also, because 55 
people cannot precisely recreate even a simple circular 
gesture, multiple sets of parameters could represent the same 
gesture. Simulations are used both to determine a viable 
gesture model and to determine if it is possible to discover 
appropriate parameters for each gesture despite variations in 60 
motion. 

We chose to represent motion gestures using dynamic 
systems. We next determined a model and a method for 
computing the model's parameters such that the model's 
parameters will best match an observed gesture motion. FIG. 65 
16 illustrates how the gesture's position is used as an input, 
with 8 representing the unknown parameter values that we 



wish to match with the "true" parameter values, 0. If these 
values match, then the error between the true states x and the 
observed states x will go to zero. 

Our choice of a model and parameter determination 
scheme was based on an exploration of the following issues: 
Off-line batch techniques versus on-line sequential tech- 
niques. We desire our gesture recognition system to 
identify gestures as they are generated, which requires 
an on-line technique. Also, the measure of how well a 
motion matches a gesture's parameters needs to be 
updated "on-line". 
State availability. Using a vision system to sense gestures 
results in image plane position information. However, 
we are using a second order system to describe ges- 
tures. Therefore, we need both positions and velocities 
for our residual error measurements (see below). Veloc- 
ity can be obtained through the use of an estimator or 
by taking a difference of position measurements. 
Unfortunately, using differences adds noise to the data, 
which could make parameter identification difficult. 
Data order dependent versus independent (for on-fine 
techniques). Certain on-line techniques will produce 
different parameter values based on the order the ges- 
ture data is presented. Because we define a gesture as 
a family of trajectories, with each trajectory in the same 
family equally valid, our method should be data order 
independent. In particular, different excursions through 
the same data set should result in the same parameters 
at the end of the data acquisition phase. 
Linear versus Non-Linear. A model is a combination of 
linear and non-linear states and parameters. Although 
perfect (non human created) circular oscillatory 
motions can be described by a linear-in-parameters and 
line ar-in -states model, a human created gesture may 
require a more complex model. Furthermore, our sys- 
tem can recognize more complex oscillatory motions. 
Therefore, a method for identifying parameters in a 
richer non-linear model is needed, because non-linear 
models can represent a much broader range of motions. 
We chose our gesture model and parameter determination 
scheme as follows. First, we decided to de-emphasize off- 
line batch techniques in favor of on-line ones for reasons 
already discussed above. The on-line method needs to be 
chosen carefully, because there are relatively few cases 
where it can be guaranteed that the estimated parameters 
will be equivalent to those resulting from off-line techniques 
applied to the entire data set as a whole. 

Next, in an attempt to use only position data, we examined 
a Series-Parallel Observer, which provides an estimate of the 
other unknown state for purely LIS and LIP systems. We 
were disappointed by this observer because it did not 
adequately estimate parameters of non-perfect human ges- 
tures. Specifically, it was problematic to extend the method 
to NLIS systems. An on-line gradient descent method was 
examined, but for presently available methods applicable to 
NLIP systems, there is no guarantee that the parameters will 
converge towards their optimal values. Also, the parameters 
computed via this method are dependent on the order the 
data is presented. A Linear Least Squares method (LLS) was 
examined next, which makes use of all the data independent 
of ordering. The resulting recursive LLS technique work for 
NLIP models, and, therefore, allow us to examine more 
flexible and useful gesture models. 

The Recursive Linear Least Squares incrementally incor- 
porates new data for determining the parameters which will 
best fit a set of data points to a given linear model. The 
recursive LLS method uses a tuning rule for updating the 
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parameter vector 8 without inverting a matrix, creating a 
more computationally efficient LLS algorithm. A tuning rule 
is required, because each block of data will result in a 
different set of parameters, as illustrated in FIG. 17. The 
separate graphs show that each pair of (x^yj data points 
results in a different best fitting 0 line. A method of incre- 
mentally updating the parameter 0 is described below. The 
concept is illustrated in FIG. 18. After the first two data 
points determine the best fit line, each additional data point 
slightly adjusts the line to a new best fit. Each new data point 
will shift the line less and less due to the weighting auxiliary 
equation in the recursive LLS method. The formulation 
below describes how the weighting function operates. 

The recursive (incremental) Linear Least Squares tuning 
method proceeds as follows. The tuning rule has the form: 

Suppose we have the output data x and state data x up to time 
m, and from this data we have already determined the best 
parameters 6 for the set. From [Cohen 96] we know that at 
the next time step, with * m+1 and 




Define 

*m+l = ^ fkfk • 

Then: 

which implies: 
Therefore: 

nrt-t 
4=1 

= e m - t&JlAfL^ -Ufa) 

This is an update law for the and 9 m+1 terms. We still 
have to find the inverse of R m ^ at each time step. 
Fortunately, the matrix inversion lemma yields: 
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Therefore: 

The above equation is a recursive formula for R^j -1 that is 
not based on taking the inverse of a matrix. The initial value 
of Rq is chosen as the identity matrix. If more importance is 
attached to recent data than to data received in the remote 
past, then we can choose 9 m to minimize: 

-flh) 

IS *=o 

where X is termed the forgetting factor and is chosen with 
0<X<1. This results in: 

25 The above recursive equation is the identifier in our 
gesture recognition system. This identifier allows us to 
represent gestures using a NLIP model, with the parameters 
identified using an on-line computationally efficient data 
order independent technique. We now determine the specific 

30 model used to represent oscillatory motion gestures. 

Given that we modeled gestures using an LIP/NL1S 
representation, the following process was used to determine 
the appropriate model. For the first step, we created phase- 
plane plots of the gestures to be modeled, as illustrated in the 

35 last plot in FIG. 14. A term in a differential equation model 
was composed of a parameter "associated with combinations 
of multiplied state variables of various powers, that is, of the 
form Q 1 x 1 'x 2 k . An example model (of a one dimensional 
motion is): 

40 

x 2 =e 1 x 1 +e 2 

Intuition was used to "guess" appropriate models that would 

45 best match the phase plane motions. Because we believed an 
acceleration model will sufficiently characterize the gestures 
in which we are interested, the x 2 equation is the one 
modified with additional terms and parameters. For each 
model, the specific parameters for each gesture in the 

50 lexicon were computed using the LLS method. 

The models were tested in simulation by measuring how 
well each tuned parameter model can predict the future 
states of its associated gesture (i.e., by computing a total 
residual error). The model which best discriminates between 

55 gestures was the chosen. If none of the models clearly 
discriminate between different gestures in a lexicon, then 
new models were tested. The heuristic we used was to add 
or delete specific terms, and determine if there was a 
significant change (good or bad) in the model's ability to 

60 discriminate gestures. 

Adding two specific terms to the above equation, that is, 
using the new model 

results in a model that is better able to discriminate between 
gestures. 
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The results of the process of modeling oscillating circles 
and lines are detailed in the remaining parts of this section. 
This process is also applicable to the determination of an 
appropriate model to classify certain non-linear gestures. 

A variety of linear- in -parameter models for good circle 
and line gesture representations were tested. As before, each 
model represented only one dimension of motion, which 
was expanded to two or three for actually gesture recogni- 
tion (i.e. an oscillating circle or line is formed when two or 
three of these decoupled models arc present, one for each 
planar motion dimension). Again, x a is the position state, 
and is the velocity state. Five of these models are shown 
below. The determination of such models illustrates how a 
new (and more comprehensive model) could be determined 
when required for more complex dynamic motions. 

To use the models described here on a digital computer, 
a fourth-order Runge-Kutta integration method was used. 
Simulations showed that a sampling rate of 10 Hz is suffi- 
ciently small to allow the use of this method. The linear- 
with-ofEset component model is the most basic second order 
linear system. The offset component allows the model to 
represent gestures that are offset from the center of the image 
plane. It contains two parameters and is of the form: 
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a smaller residual error than a bin predicting the future state 
of a gesture it does not represent. 

The computation for the residual error is based on equa- 
tion: 

Recall that f(x) is a two-dimensional vector representing the 
gesture's position and velocity. Therefore x* is the gesture's 
velocity and acceleration at sample k. We compute x k from 
the gestured current and previous position and velocity. The 
parameter vector 0 is used to seed the predictor bin. Then: 
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The residual error is then denned as the normalized differ- 
ence between the actual value of x& k and the calculated 
value of 

x k . res_eir t^-t-. — 

Ml 



*r x 2 

i 2 -e 1 jr 1 +e 2 

The Van der Pol equation is a slightly non-linear system, 
containing three parameters. The 0 2 and 8 3 parameters are 
attached to damping terms. This system is of the form: 

An offset component is added to the Van der Pol equation in 
this system. This system has four parameters and is of the 
form: 

A more non-linear system than the Van der Pol equations, the 
higher-order terms system contains additional spring-like 
components. This system has six parameters and is of the 
form: 

The Velocity Damping Terms system has additional damp- 
ing terms. It contains eight parameters and is of the form: 

The use of simulations to determine the best gesture 
model for representing oscillating circles and lines is now 
detailed. We first detail the residual measure calculation. 
Next the use of the residual measure to determine the best 
gesture model is described. 

A predictor bin is composed of a model with parameters 
tuned to represent a specific gesture. The role of a bin is to 
determine a gesture's future position and velocity based on 
its current state. To measure the accuracy of the bin's 
prediction, we compared it to the next position and velocity 
of the gesture. The difference between the bin's prediction 
and the next gesture state is called the residual error A bin 
predicting the future state of a gesture it represents will have 



FIG. 20 illustrates this concept. Consider the gesture at a 
given velocity and acceleration, sample k. At sample k+1, 
the predictions from each bin and the actual velocity and 
acceleration values are shown. The difference between a 
bin's predicted values and the gesture's actual values 
3Q (according to equation above) is the residual error for that 
particular bin. 

The total residual error is the res__err summed for all data 
samples. The following section presents the residual calcu- 
lation for each gesture with respect to each of the computed 
parameters. 

35 We now detail how we determined which parameteriza- 
tion model for the predictor bin would best differentiate 
gestures. A data set of position and velocities of gestures is 
required to test each model. Using a vision system data was 
recorded for a slow, medium, and fast circular gesture. The 

40 data is the x and y position and velocity measurements from 
the image plane of the vision system, although for these 
simulations only one of the dimensions is used. There is a 
small transition time when a human begins a gesture. This 
transient is usually less than a second long, but the residual 

45 error measurement has no meaning during this time. 
Therefore, gestures that last at least five seconds are used. 
The data recorded from a human gesture is termed "real 
gesture data." 

The total residual error was calculated by subjecting each 
5Q predictor bin to each gesture type. A measure of a model's 
usefulness is determined by examining the ratio of the 
lowest residual error to the next lowest residual error in each 
column. The worst "residual error ratio" is the smallest ratio 
from all the columns because it is easier to classify a gesture 
when the ratio is large. 



gesture input 




slow 


medium 


fast 


slow bin 


1.31 


1.20 


1.37 


Medium bin 


14.1 


0.24 


1.01 


fast bin 


424 


23.1 


0.23 



65 The residual error results of the Linear with Offset Compo- 
nent are shown in the table above. The residual errors for the 
slow and medium gestures, with respect to their associated 
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bins, are an order of magnitude lower than the other errors 
in their columns. The residual error of the fast gesture, with 
respect to the fast gesture bin, is one-forth the size of the 
closest residual error in its column (the medium gesture bin). 
Therefore, the Linear with Offset Component system is a 
good candidate for a gesture model. 



gesture input 




slow 


medium 


fast 


slow bin 


1.34 


1.26 


1.38 


medium bin 


9.8 


0.56 


1.17 


fast bin 


36 


1.79 


0.1 



As seen above, the Van der Pol model is only a fair candidate 
for gesture discrimination. The residual error of the medium 
gesture with respect to its gesture bin is only two -fifths 
smaller than the residual error with respect to the slow 
gesture bin. Also, the residual errors in the slow gesture 
column are not an order of magnitude apart. 



eesture bout 




slow 


medium 


fast 


slow bin 


1.3 


1.21 


1.37 


medium bin 


14.5 


0.22 


0.98 


fast bin 


464 


25.7 


0.11 



The Van der Pol with Offset Component model is better at 
discriminating gestures than the model without the offset 
term (see table above). The residual errors in the medium 
gesture's column are now an order of magnitude apart. 
Although the residual errors in the fast gesture's column are 
not, the discrimination is still slightly better than in the 
Linear with Offset Component model. 



gesture input 




slow 


medium 


fast 


slow bin 


1.29 


1.24 


1.37 


medium bin 


14.6 


0.18 


1.03 


fast bin 


249 


20.0 


0.11 



The table above shows the residual errors associated with 
the Higher Order model. This model is an improvement over 
the Van der Pol with Offset Component model, as the 
residual errors in the fast gesture's column are now almost 
an order of magnitude apart. 







gesture input 






slow 


medium 


fast 


slow bin 


1.28 


136 


23.3 


medium bin 


13.8 


0.17 


1 


fast bin 


8770 


35.9 


0.09 



The table above lists the residuals errors for the Velocity 
Damping model. This is the best model for discriminating 
between gestures, as the residual errors for each gesture with 
respect to their tuned bins are all at least an order of 
magnitude below the other residual errors in their columns. 



A comparison of the worst "residual error ratio" of each 
model we considered is summarized in FIG. 21, and sug- 
gests that the Velocity Damping model is the best choice for 
our application. However, the technique described here 
shows how more models could be derived and tested. For 
simple dynamic gesture applications, the Linear with Offset 
Component model would be used. For more complex 
gestures, a variation of the Velocity Damping model would 
be used. 

10 Combining One-Dimensional Motions to Form Higher- 
Dimensional Gestures 

We have shown how predictors can be used to recognize 
one-dimensional oscillatory motions. Recognition of higher 
dimensional motions is achieved by independently recog- 
15 nizing multiple, simultaneously created one dimensional 
motions. For example, the combination of two oscillatory 
line motions performed in perpendicular axis can give rise to 
circular planar gestures, as shown in FIG. 22. 

Humans have the ability to create these planar motions. 
However, they can also make these motions in all three 
dimensions (for example, circles generated around different 
axis). To recognize these planar gestures performed in 
three-dimensional space, a vision system must be able to 
track a gesture's position through all three physical dimen- 
sions. A binocular vision system has this capability, as does 
a monocular system with an attached laser range finder. Any 
of these such vision systems can be used with our gesture 
recognition system to identify three dimensional gestures. 
Development of a System to Recognize Static Gestures 

Recognizing static hand gestures can be divided into 
localizing the hand from the rest of the image, describing the 
hand, and identifying that description. The module to rec- 
ognize static hand gestures is to be both accurate and 
efficient. A time intensive process of evaluating hand ges- 
tures would prevent the system from updating and following 
motions which occur in real time. The system is intended to 
interact with people at a natural pace. Another important 
consideration is that the background may be cluttered with 
irrelevant objects. The algorithm should start at the hand and 
localize the hand from the surroundings. 

In order to meet these demands, the edges of the image are 
found with a Sobel operator. This is a very fast linear 
operation which finds approximations to the vertical and 
horizontal derivatives. In order to use only a single image, 
the greater of the horizontal and vertical component is kept 
as the value for each pixel. Besides being quick to calculate, 
an edge image avoids problems arising from attempting to 
define a region by locating consistent intensity values or 
even consistent changes in intensity. These values can vary 
dramatically in one hand and can be very hard to distinguish 
from the background as well. 

In order to describe the hand, a box which tightly encloses 
the hand is first found. This allows a consistent description 
which is tolerant to changes in scale. To locate this box, we 
55 assume a point within the hand is given as a starting point. 
This is reasonable because the hand will be the principal 
moving object in the scene. Moving objects may be easily 
separated and the center of the largest moving area will be 
in the hand. From this starting point, a prospective box edge 
60 is drawn. If this box edge intersects an existing line, it must 
be expanded. Each side is tested in a spiral pattern of 
increasing radius from the initial center point. Once three 
sides have ceased expanding the last side is baited as well. 
Otherwise, the last side would often crawl up the length of 
65 the arm. The bounding box is shown in FIG. 23. 

Once the hand has been isolated with a bounding box, the 
hand is described (FIG, 24). This description is meant to be 
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scale invariant as the size of ihe hand can vary in each 
camera image. At regular intervals along each edge the 
distance from the bounding edge to the hand's outline is 
measured. This provides a consistent description which may 
be rapidly calculated. A description is a vector of the s 
measured distances, allowing a very concise representation. 

The last task of the static gesture recognition is to identify 
the new description. A simple nearest neighbor metric is 
used to choose an identification. A file of recognized ges- 
tures is loaded in the initialization of the program. This file JQ 
consists of a list of named gestures and their vector descrip- 
tions. 

Considerations 

The primary obstacle in static gesture recognition is 
locating and separating the hand from the surroundings. 
Using sophisticated models of the hand or human body to 15 
identify with an image are computationally expensive. If 
orientation and scale are not very constrained, this cannot be 
done in real time. Our system makes descriptions quickly 
and can compare them to predefined models quickly. 

The limitations of the current system are a result of being 20 
dependent on the fast edge finding techniques. If lighting is 
highly directional, parts of the hand may be placed in 
shadow. This can cause odd, irregular lines to be found and 
defeat the normal description. If the background immedi- 
ately surrounding the hand is cluttered with strongly con- 2 s 
trasting areas, these unrelated lines may be grouped with the 
hand. This also causes unpredictable and unreliable descrip- 
tions. Such a background is very difficult to separate without 
making assumptions about the hand color or the size of the 
hand. An upper and lower bound are placed on the size of the 3Q 
hand in the image, but these permit a wide range of distances 
to the camera and are needed to assure that enough of the 
hand exists on image to make a reasonable description. 

As long as the hand is within the size bounds (more than 
a speck of three pixels and less than the entire field of view) 
and the immediate surroundings are fairly uniform, any hand 35 
gesture may be quickly and reliably recognized. 

Multiple camera views can be used to further refine the 
identification of static gestures. The best overall match from 
both views would be used to define and identify the static 
gestures. Furthermore, the system works not just for "hand" 40 
gestures, but for any static type of gestures, including foot, 
limb, and full body gestures. 

The Overall Gesture Recognition System 

In this section, based on the discussed functional and 45 
representational issues, we detail the specific components of 
a dynamic gesture recognition system according to the 
invention from an architectural and implementational view- 
point. In the preferred embodiment, the system is composed 
of five modules. ETC 3 illustrates the signal flow of the 50 
gestural recognition and control system, from gesture 
creation, sensing, identification, and transformation into a 
system response. 
Gesture Creator 

In the Gesture Creator module, a human or device creates 55 
a spatial motion to be recognized by the sensor module. Our 
gesture recognition system was designed to recognize con- 
sistent yet non-perfect motion gestures and non-moving 
static gestures. Therefore, a human as well as a device can 
creates the gestures which can be recognizable by the 60 
system. Human gestures are more difficult to recognize due 
to the wide range of motions that humans recognize as the 
same gesture. We designed our gesture recognition system to 
recognize simple Lissagous gesture motions (repeating 
circles and lines), advanced motions such as "come here" 65 
and "go there", and static hand symbols (such as "thumbs- 
up"). 
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Dynamic Gesture Lexicon 

A gesture lexicon is a set of gestures used for communi- 
cation or device control. We chose gestures for our lexicon 
based on the following: 

Humans should be able to make the gestures easily. 
Device gestures in the form of repeated motions should be 

modeled the same as human gestures. 
The gestures should be easily represented as a dynamic 
system. 

The lexicon should match useful gestures found in real 
world environments. 

The dynamic gestures used in this system are preferably 
based upon three one-dimensional oscillations, performed 
simultaneously in three dimensions (or two oscillations 
performed in two dimensions). A circle is such a motion, 
created by combining repeating motions in two dimensions 
that have the same magnitude and frequency of oscillation, 
but with the individual motions ninety degrees out of phase. 
A "diagonal" line is another such motion. To illustrate this, 
we define three distinct circular gestures in terms of their 
frequency rates: slow, medium, and fast. Humans create 
gestures that we define as slow large circles (slow), fast large 
circles (medium), and fast small circles (fast). More com- 
plex gestures can be generated and recognized, but these 
simple ones are used for illustrative purposes. 
Main Three Gestures 

Using the simpler Linear with Oflset model (whose 
parameters are easier to understand than the more complex 
models), we represented a circle by two second order 
equations, one for each axis: 

and 

x 2 =e i y 1 +y 2 

The preferred gesture model has no "size" parameter. 6i is 
a frequency measure, and 8 2 is- a drift component. The 
gestures were named "large", "small", "fast", and "slow" 
due to the human motions used to determine the parameters 
(see FIG. 25). A fast small circle is used to represent a fast 
oscillation because humans can not make fast oscillations 
using large circles. Models with higher order terms would 
have parameters with different representations. 
Expanded Lexicon — Geometric Constraints 

A total of twenty-four gestures are possible from this 
example representation when the following are distinct 
gestures: clockwise and counter-clockwise circles, diagonal 
lines, one dimensional lines, and small and large circles and 
lines. Geometric constraints are required to expand the 
lexicon, because different gestures can result in the same 
parameters. FIG. 4 shows motions that would cause an 
identifier to produce the same frequency measure and drift 
components as it would produce when identifying a slow 
large circle. When x and y oscillating motions are 90 degrees 
out of phase, a clockwise circle is produced. Motions that are 
270 degrees out of phase result in a counter clockwise circle. 
In phase motions produce a line with a positive slope. When 
the motions are 180 degrees out of phase, a line with a 
negative slope is produced. We can create additional ges- 
tures from the fast small circle in the same manner. 

Given the various combinations of slow, fast, small, and 
large circles, the only one not used as a gesture is the slow 
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small circle. Since the slow small circle has the same 
oscillation frequency (medium) as the fast large circle, we 
need another geometric feature, the circle's size, to differ- 
entiate between these two gestures. As with the previous 
gestures, additional gestures can be created from these two 
gestures by varying the phase relationships. FIG. 6 shows a 
representation of the 24 gestures in this example lexicon. 

Phase relationships are determined as fallows. During the 
gesture, the x's and y's minimum and maximum image 
plane positions are computed. If the x and y motions are out 
of phase, as in a circle, then when x or y is minimum or 
maximum, the other axis's velocity is large. The clockwise - 
ness of the motion is determined by looking at the sign of 
this velocity component. Similarly, if the x and y motion are 
in phase, then at these extremum points both velocities are 
small. A similar method is used when the gesture is per- 
formed in three dimensions. 
Sensor Module 

Unmodified Cohu solid-state CCD cameras are used as 
the sensor devices. No filters were used and the background 
was not modified. A Matrox Meteor capture card was used 
to scale a captured image to any size without missing any 
frames. It will capture and transfer full-resolution, full-frame 
NTSC (640x480) or PAL (768x576) video input in real-time 
(30 Hz). 

The color tracking system (CTS) uses the color of the 
hand and its motion to localize the hand in the scene, as 
shown schematically in FIG. 26. The hardware of the CTS 
system consists of a color camera, a frame grabber, and an 
IBM-PC compatible computer. The software consists of the 
image grabbing software and the tracking algorithm. Once 
the CTS is running, the graphical user interface displays the 
live image from the color camera on the computer monitor. 
The operator can then use the mouse to click on the hand in 
the image to select a target for tracking. The system will then 
keep track of the moving target in the scene in real-time. 

The color tracking system is developed on a BSD 4.0 
UNIX operating system. The hardware involved consists of 
a color camera, an image capture board and an IBM PC 
compatible. The software for the CTS is written in C and 
uses Motif for its graphical user interface. 

The present HTS system consists of a COHU 1322 color 
camera with a resolution of 494x768 pixels. The camera is 
connected to a Meteor image capturing board situated inside 
a Pentium-II 450MHz IBM-PC compatible computer. The 
Meteor board is capable of capturing color video images at 
30 frames per second. It is also able to capture these images 
at any resolution below the resolution of the camera. 

The graphical user interface for the CTS displays a live 
color image from the camera on the computer screen. The 
user can then identify the target in the scene and click on it 
using the mouse. The CTS will then track the target in 
real-time. The flow chart of the tracking algorithm is shown 
in FIG. 27. 

We capture the image using functions from the Meteor 
driver, lb provide real-time operation, we setup the board to 
signal the program using a system interrupt (SIGUSR2). 
Every time a new frame is ready, the Meteor alerts the 
program with an interrupt on this signal. The image capture 
function responds to the interrupt by transferring the current 
camera image to a buffer and processing it to find the target. 
The signal mechanism and its handling are what enable the 
system to operate in real-time. 

The graphical user interface of CTS displays the live 
camera image on the screen. The user can start tracking by 
clicking the mouse on the target. This starts the tracking 
algorithm. The graphical user interface of the CTS is shown 
in FIG. 28. 
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Once the user clicks on the target in the image, we 
compute the average color of a small region around this 
point in the image. This will be the color of the target region 
being tracked in the scene until it is reinitialized. Once 

5 tracking begins, we compute the position of the target region 
in the image using two methods. The first method tracks the 
target when there is sufficient motion of the target in the 
image. The second method will take over when there is no 
motion of the target in the scene. 

10 Before choosing the methods for finding the target in the 
scene, the system checks for motion in a region near the 
current or estimated target position using a motion detecting 
function. This function computes the difference between the 
current image and the previous image, which is stored in 

15 memory. If motion has occurred there will be sufficient 
change in the intensities in the region. This will indicate 
motion. The motion detection function will trigger if a 
sufficient number of pixels change intensity by a certain 
threshold value. 

20 If the motion detection function detects motion, the next 
step is to locate the target. This is done using the difference 
image and the target color. When an object moves between 
frames in a relatively stationary background, the color of the 
pixels changes between frames near the target (unless the 

25 target and the background are of the same color). We 
compute the color change between frames for pixels near the 
target location. The pixels whose color changes beyond a 
threshold make up the difference image. Note that the 
difference image will have areas, which are complementary. 

30 The pixels where the object used to be will complement 
those pixels where the object is at now. If we separate these 
pixels using the color of the target, we can compute the new 
location of the target. The set of pixels in the difference 
image, which has the color of the target in the new image, 

35 will correspond to the leading edge of the target in the new 
image. If we assume that the target approximates an ellipse 
of known dimensions, we can compute the position of the 
center of the target (ellipse) from this difference image (see 
FIG. 29). 

40 The color of a pixel in a color image is determined by the 
values of the Red, Green and Blue bytes corresponding to 
the pixel in the image buffer. This color value will form a 
point in the three-dimensional RGB color space (see FIG. 
30). For our tracking system, when we compute the average 

45 color of the target, we assume that the target is fairly evenly 
colored and the illumination stays relatively the same. The 
average color of the target is then the average RGB values 
of a sample set of pixels constituting the target. When the 
target moves and the illumination changes the color of the 

50 target is likely to change. The color matching function 
allows us to compute whether a pixel color matches the 
target color within limits. When the illumination on the 
target changes, the intensity of the color will change. This 
will appear as a movement along the RGB color vector as 

55 shown in FIG. 30. In order to account for slight variations in 
the color, we further allow the point in color space to lie 
within a small-truncated cone as shown in the figure. Two 
thresholds will decide the shape of the cone. One for the 
angle of the cone and one for the minimum length of the 

60 color vector. Thus, any pixel whose color lies within the 
truncated cone in color space will be considered as having 
the same color as the target. 

When the motion detection function fails to detect sig- 
nificant motion in the scene, we use a static target matching 

65 function to compute its location. The function searches a 
small area about the current location of the target to find the 
best fit in the image for the target The search will find the 
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location of the target with the highest matching value. We 
assume that the object is approximately elliptical. The 
elliptical target is hypothesized at each point in the search 
space and the matching metric is computed. This matching 
metric function uses a combination of edge and interior 
color matching algorithms to get a single matching number. 

The image capture board is capable of providing us with 
a 480x640-pixel color image at 30 frames per second. 
Processing such a large image will slow down the program. 
Fortunately, the nature of the tracking task is such that, only 
a fraction of the image is of interest. This region called the 
window of interest lies around the estimated position of the 
target in the new image. We can compute the location of the 
target in the new image from the location of the target in the 
previous image and its velocity. This simple method is able 
to keep track of the target even when the target moves 
rapidly. We have found that the window of interest is 
typically 1/1 OCX* the area of the original image. This speeds 
up the computation of the new target location considerably. 
A computer with a higher processing speed could process 
the entire image without resorting to creating a region of 
interest. 

Identification Module 

The gesture recognition algorithms are located in the 
Identification Module. This module uses the position and 
velocity information provided by the sensor module to 
identify the gesture. The module, shown in FIG. 31, com- 
prises of three components — the Dynamic Gesture Predic- 
tion module, the Static Gesture Identification module, and 
the Overall Determination module (Which Gesture?). The 
output of the Overall Determination module is sent to a 
screen display and to the SSM which produces an output 
based on the gesture command received. 
The Dynamic Gesture Prediction Module 

The Dynamic Gesture Prediction module contains a bank 
of predictor bins (see FIG. 32). Each predictor bin contains 
a dynamic system model with parameters preset to a specific 
gesture. We assumed that the motions of human circular 
gestures are decoupled in x and y. Therefore, there are 
separate predictor bins for the x and y axes. In this example 
of three basic two dimensional gestures, a total of six 
predictor bins are required. The position and velocity infor- 
mation from the sensor module is fed directly into each bin. 

The idea for seeding each bin with different parameters 
was inspired by Narendra and Balakrishnan's work on 
improving the transient response of adaptive control system. 
In this work, they create a bank of indirect controllers which 
are tuned on line but whose identification models have 
different initial estimates of the plant parameters. When the 
plant is identified, the bin that best matches that identifica- 
tion supplies a required control strategy for the system. 

Each bin's model, which has parameters that tune it to a 
specific gesture, is used to predict the future position and 
velocity of the motion. This prediction is made by feeding 
the current state of the motion into the gesture model. This 
prediction is compared to the next position and velocity, and 
a residual error is computed. The bin, for each axis, with the 
least residual error is the best gesture match. If the best 
gesture match is not below a predefined threshold (which is 
a measure of how much variation from a specific gesture is 
allowed), then the result is ignored; no gesture is identified. 
Otherwise, geometric information is used to constrain the 
gesture further. A single gesture identification number, 
which represents the combination of the best x bin, the best 
y bin, and the geometric information, is outputted to the 
transformation module. This number (or NULL if no gesture 
is identified) is outputted immediately upon the initiation of 
the gesture and is continually updated. 
Determining Parameter Values 

The parameters used to initially seed each predictor bin 
were calculated by feeding the data of each axis from the 



three example basic gestures into the recursive linear least 
squares. The values for each bin are summarized in the 
following Table: 
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Parameter \&toes 




x-ihcta-1 


x-theta-2 


y-theta-1 


y-lheta-2 


slow bin 


-0.72 


149 


-0.73 


103 


medium bin 


-16.2 


3467 


-16.3 


2348 


fast bin 


-99.3 


20384 


-97.1 


12970 



The Static Gesture Identification Module 

The Static Gesture Identification module only searches for 
static gestures when the hand motion is very slow (i.e. the 
norm of the x and y velocities is below a threshold amount). 
When this happens, the module continually identifies a static 
gesture or outputs that no gesture was found. 

The static gestures may be easily expanded by writing 
new gesture descriptions to a configuration file. Each gesture 
is described by a name tag, width, height, x location, y 
location, base side, and three vectors (in this example, each 
consisting of 15 integers) describing the profile of the hand. 
Because profiles may be significantly different due to vary- 
ing tilts of the hand, multiple descriptions of fundamentally 
the same gesture may be desired. The initial or last line may 
also be less reliable due to missing the contours of the hand 
edge image. 

Example parameter files are depicted in the following 
table: 
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Parameters for Halt 

name: halt arm: 14 width: 32 height: 47 xloc: -1 yloc: -1 

44000000000068 10 

98 8743 3 3 22 1 1 1 1 2 

17 17 16 12 11 10 10 98 1 1 246 9 

Parameters for Turn Right 

name: go_right arm: 11 width: 47 height: 31 xloc: -1 yloc: 0 
47 27 26 23 8 5 1 1 1 23 4 19 12 14 21 
31 11 9 7 10 10 9 10 5 2 1 5 8 10 13 
31 14 10 10 6 5 4 3 23 2 1 1 1 2 
Parameters for Acknowledge 

name: acknowledge arm: 11 width: 38 height: 46 xloc: 0 yloc: 0 

38 6 6 8 11 12 10 3 2 1 3 3 9 6 12 

46 23 20 3 1 4 7 2 13 16 17 19 21 22 24 

46 17 11 211227333477 

Parameters for Freeze (fist) 

name: freeze arm: 14 width: 27 height: 29 xloc: -1 yloc: -1 
0004 66322236708 
27 12 12 4 4 3 3 3 2 2 2 1 1 1 1 
27 14 14 13 13 13 4 2 2 2 3 3 1 2 3 



In each the name string is followed by an arm side, width, 
height, x location and y location. The arm parameter is 
simply an integer corresponding to above, below, right, or 
left. The width and height are measured in pixels. The x and 
y location are 0 if the location is not important or +1 or -1 
to restrict recognition of a gesture to one particular quadrant. 
The following three vectors are the extreme side (the end of 
the hand) then the top or left side followed by the bottom or 
right side. The determination of which side is being repre- 
sented is determined by the arm side parameter. For 
example, if the base side is from below (as in the Halt 
gesture below) the first line is from above, then from the left, 
then from the right. Right and left refer to the overall 
image — not the facing of the imaged person. 

Another method used for this pan is to parameterize each 
part of the hand (palm, digits, and wrist) as a set of 
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connected "blobs", that is, three dimensional shapes which where x is vector describing position and velocity 

are connected together geometrically. As before, a configu- components, and G is a tunable parameter; 

ration file would be used to defile how these blobs are . . 4 . . • . ■ . , , ... #L 

connected, with the vision system identifying the blobs ca P turm S moUon t0 * *»W ™* | hc 

which this module sticks together. tunable parameters associated with a gesture-making 

The Overall Determination Module target; 

This "Which Gesture?" module takes input from both the extracting the position and velocity components of the 

Static and Dynamic Gesture modules. When the velocity is captured motion* and 

small, then a static gesture is observed. When the velocity is ' 

greater than a threshold value, then a dynamic gesture is identifying the dynamic gesture by determining which 
observed. The gesture identified is continuously outputted, 10 differential equation is solved using the extracted co al- 
and can therefore change value over time (the output can ponents and the tunable parameters, 
even be that no gesture was identified). The gesture identi- 2. The method of claim 1, wherein the target is a human 
fied is sent to the transformation module. hand human Qcad m bod bod t or objcct • 

Transformation Module t . _ t - H .^«„ a^a^, &~\a „p „:™ / 

„ A - x . -.I., . • „ the motion capturing device s field or view. 
The transformation module take a gesture type as its input _ ™. *f. j ? 1 • * 1 j- *u * r 
anddetermmeswhattodowithitlnmecaseofthissystem, 15 3 " ™ c ™fcod of claim 2, further including the step of 
the gesture is converted to parameters which represent the generating a bounding box around the object, 
static or dynamic gesture, which is sent to the system which 4. The method of claim 1, further including the step of 
uses this information to produce a response. using an operator to find the edges of the target. 
System Response 5. The method of claim 1, further including the step of 
The gesture command can be used for a wide variety of 20 treating a dynamic gesture as one or more one or multi- 
purposes. These include: dimensional oscillation. 

Commands into a virtual reality simulator, to control and 6. The method of claim 5, further including the step of 

interact with the environment. creating a circular motion as a combination of repeating 

Commands for a self service machine (SSM), such as a motions in one, two, or three dimensions having the same 

public information kiosk or Automated Teller 25 magnitude and frequency of oscillation. 

Machines ^* T& e method of claim 5, further including the step of 

Commands to control an actuated mechanism, such as a deriving complex dynamic gestures by varying phase and 

, 4 ... , . magnitude relationships, 

robot arm or mobile robot. g ^ method of claim ^ inchlding ^ step of 

Commands to control any device (such as a home 3o der j vm g a multi-gesture lexicon based upon clockwise and 

appliance). counter-clockwise large and small circles done-dimensional 

It is important to note that these devices can be controlled lines. 

using static gestures, dynamic gestures, or a combination of 9. The method of claim 5, further including the step of 

the two. Thus, there is more information available to these comparing to the next position and velocity of each gesture 

system from the gesture input device, thereby allowing for ^ to one or more predictor bins to determine a gesture's future 

a greater ability for humans to command and control them. position and velocity. 

The key features of our architecture are the prediction 10. The method of claim 9, further including the use of a 

modules and the signal flow from gesture creation to system velocity damping model to discriminate among non- circular 

response. The other modules could be replaced with func- dynamic gestures. 

tionally equivalent systems without changing the structure 11. The method of claim 5, further including e use of 

of our architecture. For example, instead of a human, a robot 40 dynamic system representation to discriminate among 

could create the gesture. Alternatively, one could create the dynamic motion gestures. 

gesture using a stylus, with a graphics tablet replacing the 12. A. gesture-controlled interface for self-service 

vision system in sensor module S. The graphics tablet would machines and other applications, comprising: 

output the x and y coordinates to the identification module a sensor module for capturing and analyzing a gesture 

I. Similarly, module R could be a robot, one as complex as 45 mac j e by a human or machine, and outputting gesture 

a six degree of freedom robot arm or as simple as a stepper descriptive data including position and velocity infor- 

motor based camera platform. The former mechanism matkm assoc j ated w j m the gesture; 

requires a more complex transformation scheme in module j 1 ^ . c 

T/while the latter system needs only a simple high level an identification module operative to identify the gesture 

command generator. 50 bascd U P 0D 54:11501 data out P ut °y me sensor modulc i 

As discussed earlier, the static and dynamic identification and 

modules contains the majority of the required processing. a transformation module operative to generate a command 

Compared to most of the systems developed for gesture based upon the gesture identified by the identification 

recognition, this system requires relatively little processing module. 

time and memory to identify one gesture feature. This makes ^ 13. The interface of claim 12, further including a system 

it possible to create a system with the ability to identify response module operative to apply to the command from 

multiple features in parallel. A sophisticated module could the transformation module to the device or software program 

then examine the parallel gesture features and infer some to be controlled. 

higher level motion or command. - % 14. The interface of claim 13, wherein the device is a 

We claim: virtual-reality simulator or game. 

1. A method of dynamic gesture recognition, comprising 60 15. The interface of claim 13, wherein the device is a 

the steps of: self-service machine. 

storing a dv^ajmc-rootio u m ode l composed of a set of 16^The interface of claim 13, wherein the device forms 

differential unions>each differential equation describ- part of a robot. 

ing a parWulaTSynamic gesture to be recognized of the ^ 17. The interface of claim 13, wherein the device forms 

f orm; 65^ part of a commercial appliance. 

MM ^ * * * * * 
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